repository: make check and delete work at N=1 with sha256 pack ids#9783
repository: make check and delete work at N=1 with sha256 pack ids#9783mr-raj12 wants to merge 1 commit into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #9783 +/- ##
==========================================
- Coverage 84.88% 84.84% -0.05%
==========================================
Files 92 92
Lines 15172 15166 -6
Branches 2275 2281 +6
==========================================
- Hits 12879 12867 -12
- Misses 1589 1594 +5
- Partials 704 705 +1 ☔ View full report in Codecov by Harness. |
28c817b to
f2ca1ff
Compare
A pack is named sha256(pack_bytes), not chunk_id. delete becomes a no-op (real removal is store_delete at the pack level); list iterates the chunk index; build_chunkindex_from_repo reads pack headers for real chunk_ids and gains init_flags so compact computes usage; tests drop a chunk's pack.
f2ca1ff to
4121509
Compare
| chunk index, since the pack file name is the pack_id, which need not equal the chunk_id. | ||
| """ | ||
| entry = repository.chunks.get(id) | ||
| repository.store_delete("packs/" + bin_to_hex(entry.pack_id)) |
There was a problem hiding this comment.
if entry is not None: ... else: ...
| # delete is a no-op now (see Repository.delete), so the object stays retrievable. | ||
| repository.delete(key50) | ||
| with pytest.raises(Repository.ObjectNotFound): | ||
| repository.get(key50) | ||
| assert pdchunk(repository.get(key50)) == b"SOMEDATA" |
There was a problem hiding this comment.
Maybe rather don't test Repository.delete at all than testing some behaviour that doesn't make sense.
| # N=1: the defect chunk is alone in its pack; drop the pack. N>1 needs compaction. | ||
| pack_id = self.chunks[defect_chunk].pack_id | ||
| del self.chunks[defect_chunk] | ||
| self.repository.delete(defect_chunk) | ||
| self.repository.store_delete("packs/" + bin_to_hex(pack_id)) |
There was a problem hiding this comment.
what part of "don't do actions that are very harmful for the general case" didn't you understand?
you are killing a complete pack here because one chunk was problematic.
| for info in repository.store_list("packs"): | ||
| pack_id = hex_to_bin(info.name) | ||
| pack = repository.store_load("packs/" + info.name) | ||
| for chunk_id, obj_offset, obj_size in RepoObj.iter_object_headers(pack): | ||
| num_chunks += 1 | ||
| chunks[chunk_id] = ChunkIndexEntry( | ||
| flags=init_flags, size=0, pack_id=pack_id, obj_offset=obj_offset, obj_size=obj_size | ||
| ) |
There was a problem hiding this comment.
This will read all the data that is stored in all packs.
There needs to be at least a TODO to improve efficiency by only loading the RepoObj headers from the packfile and not also all metadata and data.
There was a problem hiding this comment.
This is also bit tricky considering how the borgstore internal caching works right now (IF the Store is configured with a cache): on first access, it will load the complete pack. So even if borg would only read little from the pack, it would fetch the whole pack anyway.
| flags=init_flags, size=0, pack_id=pack_id, obj_offset=obj_offset, obj_size=obj_size | ||
| ) | ||
| else: | ||
| # Legacy repo: list() reads its own segment index (no recursion). get() routes through that |
There was a problem hiding this comment.
That "segment index" used to be called "repository index" in borg1.
There was a problem hiding this comment.
Is that legacy part even used? By what?
| return data[hdr_size + hdr.meta_size :] # crypted data | ||
|
|
||
| @classmethod | ||
| def iter_object_headers(cls, pack: bytes): |
There was a problem hiding this comment.
Problematic: API requires giving the complete pack contents.
Could do partial loads to only load the interesting parts of the pack and skip 99% of the data it does not need.
| if key <= last_key_checked: # needs sorted keys | ||
| continue | ||
| try: | ||
| obj = self.store.load(key) |
There was a problem hiding this comment.
keeping the old name here is problematic. obj would make readers think this is a repoobj, while in reality it is a complete pack.
| info = dict(id=self.id, version=self.version) | ||
| return info | ||
|
|
||
| def check(self, repair=False, max_duration=0): |
There was a problem hiding this comment.
guess Repository.check rather needs a rewrite than the little changes you do here.
context:
borg1 check --repository-only for a remote (ssh:) repository ran the Repository code ON THE REPO SERVER. Thus, reading all data was just local file I/O on the server and acceptable.
borg2 does most things client-side by default and the remoting happens inside borgstore - the Repository class code is NOT running on the repo server, but always on the client. Thus, reading all data from all packs is not acceptable because it would transfer everything over the network. Some borgstore code IS running on the repo server though (at least for "rest:" repos), e.g. Store.hash can be done repo-server-side.
For borg2 check --repository-only we could do it like this:
- iterate over all pack ids
- use store.hash(packname) to compute the hash from the data ON THE REPO SERVER
- compare computed hash with pack id
- if it matches, the pack does not have corruption.
- if it does not match, the pack is corrupted. without --repair just report that. with --repair we need to fetch it, iterate over all repoobjs and parse them into meta and data (this decrypts and find corrupted repoobjs because the AEAD authentication+decryption fails in the authentication step). drop corrupted repoobjs, write good repoobjs into new pack, update the index.
The index could be verified in a similar way, just comparing hashes. If there is a mismatch, emit a warning so that the user will rebuild the index.
Guess borg check without --repair would not rebuild an index by default (too expensive/slow), but just assume that the index is ok IF the hashes are ok.
There was a problem hiding this comment.
Does not necessarily need to be in this PR, you decide.
| # <marker> ourselves; index order is stable unless the index is mutated, which is all the | ||
| # marker pagination needs. | ||
| self._lock_refresh() | ||
| collect = marker is None |
There was a problem hiding this comment.
Yeah, that is a better expression than the original code.
|
Good, failures down to 6 in sha256-packid mode. \o/
Maybe check if that is a indexing problem (index not updated after put(chunkid, corrupted_data)) or if there is something wrong in the check code still. Hmm, maybe this is the wrong way to test that. The idea of that test was to simulate a corrupted file chunk (like bit rot), but that put() also writes it to a new packid and also it does a put for a chunkid we already have in the repo (that is an action that borg usually never does). Guess it would be better and also more realistic to low-level corrupt the chunk data inside the pack without using put() and without touching the index. |
|
BTW, if you address review feedback, can you rather do new commits than amending please - that is easier for review. At the end, git rebase -i can be used to squash stuff together. |
Description
borg check,deleteandcompactall assumed that a pack file's name is the chunk_id it holds. That is true only at N=1 while pack ids equal chunk ids. WithBORG_TESTONLY_SHA256_PACK_ID=1(and later at N>1) a pack is namedsha256(pack_bytes), so the file name is the pack_id, not the chunk_id.The object header already carries the chunk_id, meta size and data size (
repoobj.py), so the chunk index can be rebuilt from the headers (real chunk_ids) instead of the file names, and removal can find the pack through the index.Changes:
repoobj.py: addRepoObj.iter_object_headers(), which walks a pack's object headers and yields(chunk_id, obj_offset, obj_size). Works for one object per pack (N=1) and for several (N>1).repository.py:delete()becomes a logged no-op. A pack may hold several objects, so we can not remove one object by dropping its pack without losing the others; real removal goes throughstore_deleteat the pack level.list()iterates the chunk index and yields chunk_ids, instead of listingpacks/(which would yield pack_ids).cache.py:build_chunkindex_from_repo()rebuilds the index fromstore_list+iter_object_headers(real chunk_id/offset/size) rather than callingRepository.list(), which would recurse into the very index it is building. Adds aninit_flagsargument socompactcan start the chunks as unused and compute usage itself.compact_cmd.py:--statsbuilds the index from a real header scan instead of apack_id == chunk_idfake; the delete loop drops each unused chunk's pack viastore_delete.archive.py/debug_cmd.py:check --repairanddebug delete-objresolve the pack via the chunk index and thenstore_deleteit;debug delete-objmaps a missing pack (stale index entry) back to "not found".delete_chunkhelper drops a chunk's pack at the store level (used by the check / extract / mount damage tests), sinceRepository.deleteno longer removes anything;test_empty_repositorydrops packs through the store; test objects get a unique chunk_id so identical payloads do not collapse into one pack under sha256.Verified with the local test subset, with and without
BORG_TESTONLY_SHA256_PACK_ID=1: the repository- and archive-level missing-chunk checks pass under the switch and nothing regresses without it. Archive--verify-datacorruption detection under sha256 pack ids (a fewcheck --verify-datatests) is left for a separate change.refs #8572
Checklist
master